3 research outputs found

    Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines

    Get PDF
    A significant rise in the adoption of streaming applications has changed the decisionmaking processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, and others. Spark Streaming, a widespread open-source implementation, processes data-intensive applications that often require large amounts of memory. However, Spark Unified Memory Manager cannot properly manage sudden or intensive data surges and their related inmemory caching needs, resulting in performance and throughput degradation, high latency, a large number of garbage collection operations, out-of-memory issues, and data loss. This work presents a comprehensive performance evaluation of Spark Streaming backpressure to investigate the hypothesis that it could support data-intensive pipelines under specific pressure requirements. The results reveal that backpressure is suitable only for small and medium pipelines for stateless and stateful applications. Furthermore, it points out the Spark Streaming limitations that lead to in-memory-based issues for data-intensive pipelines and stateful applications. In addition, the work indicates potential solutions.N/

    Boosting big data streaming applications in clouds with burstFlow

    Get PDF
    The rapid growth of stream applications in financial markets, health care, education, social media, and sensor networks represents a remarkable milestone for data processing and analytic in recent years, leading to new challenges to handle Big Data in real-time. Traditionally, a single cloud infrastructure often holds the deployment of Stream Processing applications because it has extensive and adaptative virtual computing resources. Hence, data sources send data from distant and different locations of the cloud infrastructure, increasing the application latency. The cloud infrastructure may be geographically distributed and it requires to run a set of frameworks to handle communication. These frameworks often comprise a Message Queue System and a Stream Processing Framework. The frameworks explore Multi-Cloud deploying each service in a different cloud and communication via high latency network links. This creates challenges to meet real-time application requirements because the data streams have different and unpredictable latencies forcing cloud providers' communication systems to adjust to the environment changes continually. Previous works explore static micro-batch demonstrating its potential to overcome communication issues. This paper introduces BurstFlow, a tool for enhancing communication across data sources located at the edges of the Internet and Big Data Stream Processing applications located in cloud infrastructures. BurstFlow introduces a strategy for adjusting the micro-batch sizes dynamically according to the time required for communication and computation. BurstFlow also presents an adaptive data partition policy for distributing incoming streams across available machines by considering memory and CPU capacities. The experiments use a real-world multi-cloud deployment showing that BurstFlow can reduce the execution time up to 77% when compared to the state-of-the-art solutions, improving CPU efficiency by up to 49%

    Cache based global memory orchestration for data intensive stream processing pipelines

    No full text
    A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases
    corecore